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Section  1 

INTRODUCTION  AND  SUMMARY 

Let  X  be  a  p  component  random  vector  variable  with  transpose  X'  =  [X. , . .  .X., . .  .X  ] 

•‘■Jr 

and  Y  be  a  one -dimensional  random  variable  with  a  joint  continuous  distribution  of 
density  f(x,y) .  The  regression  of  Y  given  X  =  x  is 


E[YjX=x]  =  J  yf(x,y)  dy  4  J  f(x,y)  dy 


(1.1) 


Using  the  consistent  nonparametric  estimators  of  the  densities  described  later  in  the 
paper,  the  regression  can  be  estimated  by  the  estimated  regression 


where 


X{  and  Yj 


n  P 

I  Y. exp  -  -h  1  <xji  -  */ 

i  =  l  Z(r  HI _ 

2-4^2  (Xji - 

i  =  l  ZtT  j  =  l 


=  number  of  observations  of  X  and  Y  used  in  estimating  the 
densities 

=  the  ith  observations  of  X  and  Y ,  respectively 

=  [xli,...,xjl,...,xpi] 

=  a  smoothing  parameter 
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It  is  further  shown  that  the  nonlinear  regression  equation  (1.2)  can  Ixi  approximated  to 
any  desired  accuracy  by  a  polynomial-ratio  regression  estimate 


Y*(x) 


Qte) 

P(x) 


(1.3) 


where  the  coefficients  of  the  polynomials  are  computed  as  a  function  of  the  observed 
sample.  The  advantage  of  the  form  Eq.  (1.3)  is  that  the  observations  are  used  only 

A 

in  the  computation  of  the  coefficients.  Subsequent  evaluation  of  Y(x)  for  a  given 
vector  x  is  usually  much  faster  using  Eq.  (1.3)  rather  than  Eq.  (1.2). 

This  same  advantage  is,  of  course,  shared  by  classical  polynomial  regression  equa¬ 
tions,  but  the  technique  described  has  the  following  advantages  compared  with  classical 
polynomial  regression  techniques  utilizing  a  single  polynomial: 

•  It  provides  a  simple  method  of  determining  the  coefficients.  The  calculation 
for  the  coefficient  of  a  particular  term  amounts  to  little  more  than  averaging 
the  corresponding  product  of  variables  oyer  the  set  of  observations  available. 

•  The  computational  and  storage  requirements  increase  only  linearly  with  the 
number  of  coefficients  used. 

•  The  shape  of  the  regress:^"  surface  can  be  made  as  complex  as  necessary  to 
closely  approximate  Eq.  (1. 1),  or  ;•  s  simple  as  desired,  by  proper  choice  of 
the  smoothing  parameter  a .  In  spite  of  this  flexibility,  Y*(x)  estimated 
from  Eq.  (1.3)  >s  bounded  by  the  minimum  and  maximum  of  the  observations 
Y.  when  Q(x)  and  P(x)  are  truncated  at  an  even  order. 

•  Because  of  the  smoothing  properties  inherent  in  the  density  estimator,  the 
number  of  coefficients  used  in  the  polynomials  can  approach  or  even  exceed 
the  number  of  observations  in  the  sample  with  no  danger  of  the  regression 
surface  overfitting  the  data  when  <7  is  suitably  chosen. 

Since  the  derivation  of  Eq.  (1.3)  from  Eq.  (1.2)  is  not  dependent  on  X  being  random, 
the  computational  advantages  of  the  polynomial  form  can  be  utilized  also  when  values 
of  X  are  specified  in  the  design  of  an  experiment  and  Y  alone  is  a  random  variable. 
This  property  applies,  of  course,  to  ordinary  polynomial  regression  as  well. 
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Section  2 

GENERAL  REGRESSION 


Let  X  be  a  p  component  random  vector  variable  with  transpose  X' s  [X  , . ,,X.,,,.X J, 

J  1— 

and  let  Y  be  a  random  variable  with  a  joint  continuous  distribution  of  density  f(x,  y)  • 
The  conditional  mean  of  Y  given  X  =  x , 


OO  00 

E[Y|X=x]  =  J  yf(x,y)  dy  +  J  f(x,y)  dy  (2.1) 

-00  -00 

is  also  called  the  regression  of  Y  on  X  and  is  determined  by  the  density  f, 

—  "  ^ 

When  the  density  f(x,y)  is  not  known,  it  must  usually  be  estimated  from  a  sample  of 
observations  of  X  and  Y .  We  estimate  the  regression  by  taking  the  regression  of  a 
nonparametric  estimate  of  f(x,y) .  The  class  of  consistent  estimators  proposed  by 
Parzen  (Ref.  1)  and  shown  to  be  applicable  to  the  multidimensional  case  by  Cacoullos 
(Ref.  2)  are  suitable  for  this  purpose.  For  reasons  expressed  in  Refs.  3  and  4,  the 
particular  estimator 


n 

oW/2  “  eXP 


(2.2) 


seems  to  be  a  good  choice  for  estimating  a  probability  density  function  f  if  it  can 
reasonably  be  assumed  that  the  underlying  density  is  continuous  and  that  its  first 
partial  derivatives  evaluated  at  any  x  are  small. 


Letting  (X*)'  =  [Xj, . . . ,  Xp,  Y]  and  (x*)»  s  [xlt . . . ,xp,y] ,  the  application  of  Eq.  (2. 2) 
yields  the  estimated  regression 


l! 
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30  00 

Y(x)  =  j  yf(x*)  dy  4  J  f(x*) 


14  r 

f  y  y  exp  -  i  (Xf  -  x*)'(Xf  -  x*)  dy 

-oo  i  =  i  2(r _ : _ 

?  “  1 

J  J,  exp  -~± -  (X*  -  x*)!(X*  -  x*)  dy 

-«o  i  =  l  l  2a  J 

V  f  1  If  (Y.  -  y)2] 

2  exP  I - 2  #i  "  "  *){  /  y  ^ - 2 — I 

•  _  i  L  2a  J  -  2a  I 

1—1 _ — QO  J 

n  .  **  *  (Y  y}^' 

y  exp  -  -^2  (X  -  x)'(X.  -  x)l  f  exp - L- £ -  dy 

t  2a  1  j  L  l  ^  J 


Jetting 


Af  =  (Xj  -  x)'(X.  -  x)  =  £  {X..  -  Xj)5 

j-1 


and  performing  the  indicated  integrations, 


Y(x)  = 


I,  v‘  -(g) 


(2.3) 


(2.4) 


Note  that  because  the  particular  estimator  Eq.  (2. 2)  is  readily  decomposed  into  X 
and  Y  factors,  the  integrations  were  accomplished  analytically.  The  resulting 
regression  equation  (2.7)  which  involves  summations  over  the  observations  is  simply, 
applicable  to  problems  involving  numerical  data. 
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Although  Y(x)  will  lx;  expressed  in  terms  of  series  expansions  to  introduce  a  com¬ 
putationally  more  efficient  approximation  in  the  next  section,  the  main  properties  of 

A 

Y(x)  are  evident  in  the  present  form. 

A 

Parzen  (Ref.  1)  and  Cacoullos  (Ref.  2)  have  shown  that  the  density  estimator  f(x) 

(Eq.  (2.2)]  used  in  estimating  Eq.  (2. 1)  by  Eq.  (2.4)  is  a  consistent  estimator 
(asymptotically  converges  to  the  underlying  probability  density  function  f(x)  ]  at  all 
points  x  at  which  the  density  function  is  continuous,  providing  that  a  =  e(n)  is  chosen 
as  a  function  of  n  such  that 

lim  of n)  =  0 
n—  *> 

and 

lira  n<j(n)  ■=  « 
n—»o 

A 

The  estimate  Y(x)  can  be  visualized  as  a  weighted  average  of  all  of  the  observed  values 

Yj  where  each  observed  value  is  weighted  exponentially  according  to  its  Euclidean 

distance  from  x.  When  the  smoothing  parameter  a  is  made  large,  the  estimated 

density  is  forced  to  be  smooth  and  in  the  limit  becomes  multivariate  Gaussian  with 
2 

covariance  or  I.  On  the  other  hand,  smali  a  allows  the  estimated  density  to  assume 
non-Gaussian  shapes  but  with  the  hazard  that  wild  points  may  have  too  great  an  effect 

A 

on  the  estimate.  As  o—  <*> ,  Y(x)  assumes  the  value  of  the  sample  mean  of  the  ob- 

A 

served  Y i ,  and  as  a— o  ,  Y(x)  assumes  the  value  of  the  Y.  associated  with  the 
observation  closest  to  x  .*  (This  case  is  treated  in  more  detail  in  Ref.  5. )  For 

_  -  g  2 

♦Consider  two  observations  (Xlf  Yj)  and  (X2,  Y2)  such  that  A2  =  A.  +  e,  e>0  for 

some  value  of  x.  Then  frorn  Eq.  (2.4), 
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intcnnediate  values  of  <7,  all  values  of  Y.  are  taken  into  account,  but  those  corres¬ 
ponding  to  points  nearer  to  x  are  given  heavier  weight. 

When  the  underlying  parent  distribution  is  not  known,  it  is  not  possible  to  compute  an 
optimum  a  for  a  given  number  of  observations  n .  It  is  therefore  necessary  to  find 
<r  on  an  empirical  basis.  This  can  be  done  quite  easily  when  the  density  estimate  is 
being  used  in  a  regression  equation  because  there  is  a  natural  criterion  which  can  be 
used  for  evaluating  each  value  of  cr;  namely,  the  correlation  between  Yj  and  the 
estimate  Y(X.)  for  each  of  the  observed  samples.  One  precaution  is  necessary,  how¬ 
ever.  For  this  p-irpose  Y(Xj)  must  be  modified  to 

Y(X  )  =  £  Y  exp  (-  Af/2tr2)  *  £  exp  (-  A2/2a2) 
j*i  J  '  }  j*i  J 

A 

so  that  each  Y(X.)  is  based  on  inference  from  all  the  observations  except  the  actual 
observed  value  at  X. .  This  procedure  is  used  to  avoid  an  artificial  maximum  cor¬ 
relation  as  a  —  0  which  results  when  the  estimated  density  is  allowed  to  fit  the 
observed  data  points.  (Overfilting  of  the  data  is  also  present  in  the  least-squares 
estimation  of  linear  regression  surfaces,  but  is  not  as  severe  because  the  linear 
regression  equation  has  on’v  p  +  1  degrees  of  freedom.  If  n  »  p ,  the  phenomenon 
of  overfitting  can  be  and  is  commonly  ignored  in  least-squares  regression. ) 
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Section  3 

A  POLYNOMIAL  EQUIVALENT 


3.1  DERIVATION 

In  Ref.  4  it  was  shown  that  since  the  density  estimator  f(x)  can  be  written 

n 

f(x)  =  (2jto2)  P/^2  exp  (-  (x'x)/2cr2]  ^  exp  |  x'Xj/cr2]  exp 

i  =  1 


it  can  be  replaced  by  a  polynomial  approximation  based  on  a  Taylor’s  series  expansion 
of  expIx’Xj/o2!.  1,1  many  circumstances,  this  approximation  requires  substantially 
less  computation  than  f(x)  [Eq.  (2. 2)] .  The  polynomial  version  of  the  estimator  has 
the  form 


f*(X>  =  (27ra2)"P/2 


exp  {-  (x’x)/2o  ] 


pf(x) 


(3.1) 


where 


P,(x)  = 


V  ji  h  jp 

^  V’-J/1*2  *“  xp 

j<f  1  p 


iis  0 

J  =  +  ^2  +  *  *  *  *  ip 


(3.2) 


The  coefficients  a 


Jl*  •  •  jp 


are  computed  from  the  observations  X^  using 


a. 


I  •  • 


Ip 

XpP  exp 


(-  W2*2)] 


(3.3) 
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and  the  sum  is  over  all  j  for  which  J  <  £ .  Note  that  each  coefficient  a. .  > 

J 1  *  *  *  Jp 

involves  the  ith  observation  X,  only  in  one  of  a  sum  of  n  terms. 

Although  the  generality  of  the  notation  used  makes  the  equations  look  formidable, 
consider  the  coefficient  of  a  specific  term  in  Eq.  (3.2)  such  as  the  coefficient  0 

of  .  Then 

n 

a110...0  *  74  I  XHX2i  (-  W2^) 

™  1  =  1 

In  words,  this  equation  says  to  take  the  average  of  the  products  of  the  cross  products 
XliX2i  antI  a  "normalizing  factor"  exp  /-  XjX^a2)  ;  then  to  multiply  this  average  by 
a  "premultiplying  constant"  1/n <r4 .  Each  term  has  its  own  premultiplying  constant. 
Note,  however,  that  all  terms  for  an  observation  have  the  same  normalizing  factor. 

The  normalizing  factor,  therefore,  need  be  calculated  only  once  for  each  observation, 
regardless  of  the  number  of  coefficients  used  in  the  polynomial  P^x) .  Considering 
this  circumstance,  and  also  the  fact  that  the  premultiplying  constant  is  not  data- 
dependent,  the  algorithm  implied  by  Eq.  (3.3)  amounts  to  little  more  computation  than 
simply  making  each  coefficient  equal  to  the  mean  of  the  corresponding  cross  product 
over  the  observation  set  used  for  establishing  the  coefficients. 

Note  that  the  regression  equation  (2. 4)  can  be  written 


6-79-68-6 


A  polynomial  approximation  for  the  denominator  of  Eq.  (3,4)  has  just  been  given;  a 
similar  polynomial  approximation  can  be  derived  for  the  numerator  of  this  expression. 
When  these  are  both  used  in  Eq,  (3.4),  we  approximate  Y(>t)  by  the  polynomial -ratio 
regression  estimate 


Yf<*> 


Q«<?> 


(3.5) 


where  Q^(x)  has  identically  the  same  form  as  P^(x)  except  that  the  coefficients  are 
computed  by 


(3.6) 


3.2  NORMALIZATION  OF  INPUT 

2 

Since  the  accuracy  of  a  finite  term  Taylor's  series  expansion  of  exp  (x'X/ct  )  depends 

2  ~ 
on  the  magnitude  of  x'X/a  ,  it  is  often  desirable  to  make  a  transformation  (translation) 

from  raw  measurement  vectors  (e.g. ,  Z.  for  the  ith  observation)  to  obtain  a  set 
of  vectors  Xj  for  which  the  Taylor  approximation  is  satisfactory.  Similarly,  to 
minimize  distortion  of  the  estimated  density  relative  to  the  parent  density  and  simul¬ 
taneously  to  minimize  error  due  to  the  Taylor's  series  expansion,  it  is  desirable  to 
"sphericalize"  the  data,  in  some  way.  The  simplest  procedure  consists  of  normalizing 
each  variate  to  have  a  variance  of  unity.  If  the  variables  are  highly  interdependent, 
it  may  be  desirable  to  use  more  complicated  and  specialized  preprocessing  techniques. 
Preprocessing  requirements  are  discussed  at  length  in  Ref.  6,  but  of  course  there 
are  no  techniques  which  arc  "optimum"  except  for  specific  parent  distributions. 


In  summary,  if  the  means  and  standard  deviations  of  the  raw  measurement  variables 
Zjj  over  the  set  of  observations  to  be  used  for  establishing  the  regression  surface  are 
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denoted  by  Z.  and  s^ ,  respectively,  the  usual  normalizing  necessary  may  be 
expressed  by 


x,i  -  <zji  -  Zj)/8] 


(3.7) 


3.3  A  FIRST-ORDER  CORRECTION 

The  regression  E(Y|X  =  x]  represents  that  function  h(x)  which  minimizes  the  mean- 

2 

squared  error,  E[Y-h(x)]  .  However,  even  for  large  sample  size,  this  objective 
is  not  realized  by  either  Y(x)  or  Y|(x)  because  of  systematic  distortion  of  the  esti¬ 
mated  density  which  results  when  the  smoothing  parameter  a  is  greater  than  zero. 

In  the  case  n  — ■  «> ,  the  nature  of  this  distortion  is  known  since  it  is  routine  to  show 


oO  00 

E[f(x)]  =/.../  f(X)g(x-X)dX  =  f(x)*g(x) 

— oo  *oo 


(3.8) 


where  *  indicates  convolution, 


g(x)  =  (2tt(t2)  p^2  exp  (-  x'x/2o2) 


and  f(x>  is  the  probability  density  function  of  the  distribution  from  which  the  sample 
is  drawn. 

If,  for  example,  f  is  the  normal  distribution  with  mean  p  and  covariance  $  .  as 

n  —  «> ,  f(x)  converges  to  a  normal  distribution  with  mean  p  but  with  a  covariance 

2  2 
matrix  of  |*  +  a  I],  where  I  is  the  identity  matrix.  Since  a  covariance  of  (a  I) 

represents  a  distribution  in  which  the  variates  are  completely  uncorrelated,  addition 
2 

of  a  I  to  an  arbitrary  covariance  $  increases  the  variance  terms  with  no  effect  on 
the  covariance  terms.  This  has  the  effect  of  biasing  the  estimated  density  in  the 
direction  of  lower  intercorrelations.  Since  the  intercorrelations  between  the  predicted 
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and  predictor  variables  for  the  estimated  density  are  characteristically  less  than  these 
same  intercorrelations  for  the  parent  density,  the  predictions  for  Y.  are  charac¬ 
teristically  closer  to  the  mean  than  they  should  be.  This  effect  has  been  noted  in 
experience  with  real  data. 

As  a  simple  but  extreme  example,  consider  the  case  of  Y  and  X  both  normally 
distributed  with  zero  mean,  unit  variance,  and  correlation  one.  Applying  Eq.  (2.4), 
it  can  be  seen  that  as  n  — « 


(3.9) 


In  this  example,  E[YjX  =  x]  =  x  whereas  Y(x)  is  only  proportional  to  x  and,  as 

predicted,  biased  toward  the  mean.  Similarly,  it  can  be  shown  that  a  purely  deter- 

2  A 

minis  tic  second-order  component,  Y  =  AX  ,  is  attenuated  by  the  estimator  Y(x)  to 
yield 


2  2 

_ * - +  —2 _ 

.  2  .  .,2  2 
_(a  +  1)  a  +  1 


As  a  —  0  both  first-  and  second-order  components  are  obtained  without  error,  but 
with  finite  a  the  error  can  be  appreciable.  However,  the  first-order  scaling  effect 
and  the  constant  bias  can  be  completely  compensated.  Once  Eq.  (2.4)  or  (3.5)  is  used 
to  find  a  nonlinear  relationship  between  X  and  Y ,  the  relationship  between  the  re- 

A 

suiting  scalar  Y  and  Y  should  be  essentially  linear.  The  best  linear  correction  of 

A 

Y  (in  the  least-squares  sense)  is,  of  course,  obtained  through  simple  linear  regression 
of  Y  on  Y. 
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Thus,  a  corrected  estimate  of  Y  could  be  obtained  by 

A 

Y(x)  =  a0  +  a^x)  (3.10) 

where 

“o  ■  fl  ^Yi  1  [s(Y(?i))2l  -  [2Y<X.)j  Ily.YIX,)!)  <•  D 
“1  “  -  (SYil  (2y<x.))}  .D 

D  =  n^JY^))2  -  [ 2Y(X.)]2 

and  the  summations  run  from  i  =  1  to  i  =  n . 
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Section  4 

COMPARISON  WITH  CONVENTIONAL  TECHNIQUES 


Nonlinear  regression  involves  either  a  priori  s]iccification  of  the  form  of  the  regression 
equation  with  subsequent  statistical  determination  of  some  undetermined  constants,  or 
statistical  determination  of  the  constants  in  a  general  regression  equation  -  usually  of 
polynomial  form.  The  advantages  and  disadvantages  of  both  approaches  are  well 
known.  I  will  now  point  out  some  of  the  differences  which  distinguish  the  technique 
described  in  this  paper. 

The  first  approach  requires  that  the  form  of  the  regressic  1  equation  be  known  a  priori 
or  guessed.  The  advantage  of  this  approach  is  that  it  usually  reduces  the  problem  to 
estimation  of  a  relatively  small  number  of  undetermined  constants,  and  that  the  values 
of  these  constants  when  found  may  provide  some  insight  to  the  investigator.  The  dis¬ 
advantage  is  that  the  regression  is  constrained  to  yield  a  "best  fit"  for  the  specified 
form  of  equation.  If  the  specified  form  of  equation  is  a  poor  guess  and  not  actually 
appropriate  to  the  data  base  to  which  it  is  applied,  this  constraint  can  be  serious. 
Classical  polynomial  regression  is  usually  limited  to  polynomials  in  one  independent 
variable  because  polynomials  involving  multiple  variates  often  have  too  large  a  number 
of  free  constants  to  be  determined  using  a  fixed  number  n  of  observations  (X. ,  Y. ) . 

.  A  classical  polynomial  regression  surface  may  fit  the  n-observed  |»ints  very  closely, 

but  unless  n  is  much  larger  than  the  number  of  coefficients  in  the  ixdynomial,  there 

is  no  assurance  that  the  error  (Y*  -  Y)  for  a  new  point  (X,  Y)  taken  randomly  from 

the  distribution  f(x,  y)  will  be  small.  On  the  other  hand,  with  the  regression  Eq.  (2.4) 

it  is  possible  to  let  tr  be  small  which  allows  high  order  curves  if  they  are  necessary 

to  fit  the  data,  but  even  in  the  limit  as  a  -*  0  Eq.  (2.4)  does  not  go  wild  but  merely 

estimates  Y(X)  as  being  the  same  as  the  Y.  associated  with  the  X.  which  is  closest 
«**  1 

in  Euclidean  distance  to  X.  Cover  (Ref.  5)  points  out  that,  for  a  wide  range  of 
probability  distributions,  the  large-sample  risk  associated  with  estimation  by  this 
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"nearest-neighbor"  rule  is  equal  to  only  twice  the  Bayes  risk  (for  squared  error  loss 
functions).  For  any  a  >  0  there  is  a  smooth  interpolation  between  the  observed  points 
(as  distinct  from  the  discontinuous  change  of  Y  from  one  value  to  another  at  points 
equidistant  from  the  observed  points  when  a  =  0) . 

Since  Eqs.  (2.4)  and  (3.5)  are  mathematically  identical  when  the  polynomials  are  not 
truncated,  the  above  statements  are  equally  applicable  to  the  polynomial  regression 
equation  of  the  form  Eq.  (3.5).  The  Eqs.  (3.3)  and  (3.6)  were  derived  in  such  a  way 
that  hundreds  or  thousands  of  terms  can  lie  introduced  into  the  polynomial  regression 
equation  without  overfitting  the  data  even  if  the  number  of  observed  (joints  is  less  than 
the  number  of  coefficients.  It  is  important  to  note  that  actual  identity  of  the  polynomial 
equation  and  Eq.  (2.4)  occurs  when  all  jxjssible  terms  are  included  in  the  (jolynomials. 
The  significant  point  is  that  the  jjolynomial  approximation  tends  to  fit  the  estimated 
regression  surface  given  by  Eq.  (2.4),  not  the  actual  data.  Equation  (2.4),  in  turn, 
employs  a  density  estimator  which  involves  smoothing  of  the  data  -  thereby  minimizing 
the  effects  of  randomness  in  sampling  on  the  resulting  regression  surfaces.  Thus,  the 
number  of  terms  used  in  the  polynomials  is  limited  only  to  minimize  computation; 
there  is  no  need  to  further  limit  this  number  because  of  any  danger  of  overfitting  the 
data.  As  a  practical  matter,  the  computed  polynomials  can  usually  be  truncated  to  a 

A 

low  order  with  little  degradation  in  the  correlation  between  Y(X.)  and  Y. .  If  the 
dimensionality  p  is  limited  to  3  or  4,  the  number  of  terms  can  be  held  to  a  number 
which  is  easily  manageable.* 

A  general  formula  for  the  computation  time  involved  is  not  available  but,  as  an  example, 
one  problem  which  has  been  run  many  times  with  different  data  has  p  =  3  and  maxi¬ 
mum  order  of  terms  in  the  polynomials  -  4 .  The  average  time  required  to  compute 
the  70  coefficients  using  288  observations  of  (X^.X^.Xgj,  Yj)  and  to  evaluate  Y*(x) 
for  each  of  the  288  observations  was  690  milliseconds  on  the  Univac  1108  computer. 

♦The  total  number  of  terms  in  a  polynomial  truncated  to  include  terms  up  to  the  r  order 
is  given  by  Sebestycn  (Ref.  7)  to  be  /p  +  r\ 

\  p  r 


14 


LOCKHEED  PALO  ALTO  RESEARCH  LABORATORY 
lockHito  Minim  »  s  f  *  c  i  company 
a  cio«>  division  or  loCKMico  Aitciarr  comoiahon 


6-79-08-6 


One  additional  feature  of  Eq.  (2.4)  is  that  Y(x)  is  always  bounded  by  the  maximum 
and  minimum  values  of  the  observed  Y^'s  .  In  contrast,  the  classical  polynomial 
regression  estimate  goes  to  either  »  or  -*>  as  x  goes  to  ±*>.  Surprisingly,  the 
polynomial -ratio  regression  estimate  Y*(x)  is  also  bounded  if  P^x)  and  Qj(x)  are 
truncated  to  include  only  corresponding  pairs  of  coefficients  and  enough  terms  are 
retained  so  that  the  contribution  of  each  observation  to  P^(x)  is  positive  in  the 
range  of  x  of  interest.  The  question  of  bounds  on  Y*(x)  will  be  treated  in  more 
detail  in  the  Appendix. 
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Section  5 

INDEPENDENT  VARIABLE  NOT  RANDOM 

It  has  been  {minted  out  that  although  the  motivation  used  in  arriving  at  Eq.  (2. 4)  was 

A 

based  on  both  X  and  Y  being  random  variables,  the  concept  of  using  for  Y(x)  a 
weighted  average  of  the  Y.'s  with  the  weight  for  each  Y^  being  some  monotonically 
decreasing  function  of  |  X.  -  x  |  is  attractive  intuitively  even  if  values  of  X.  are 
specified  in  the  design  of  an  experiment.  In  this  case,  only  the  values  of  Y 
corresponding  to  the  fixed  values  of  X.  would  be  measured  observations  of  a  random 

*  i 

variable.  If  for  the  monotonically  decreasing  weighting  function  we  choose 


CXP  [-  \*i  ~  x|2/2<x2J 


then  the  regression  equation  is  given  by  Eq.  (2.4).  Since  the  derivation  of  the  polynomial 
equivalent,  Eq.  (3.5),  was  not  dependent  on  the  assumption  that  X  is  random,  the 
computational  advantages  of  the  polynomial  form  can  be  utilized  even  when  X  is  not 
random ,  but  instead  values  of  X  are  specified  in  the  design  of  an  experiment. 


LOCKHEED  PALO  ALTO  RESEARCH  LABORATORY 

lOCCMttO  MlilIKt  A  I  >  A  C  I  COMPANY 
A  OtOVS  •■VISION  OS  tOCKHIIO  AIKIAII  COt'OllllOH 


6-79-G8-G 


Section  6 
REFERENCES 

1.  E.  Parzen,  "On  Estimation  of  a  Probability  Density  Function  and  Mode, "  Ann. 
Math.  Statist.,  Vol.  33,  1962,  pp.  1065-1076 

2.  T.  Caceullos,  "Estimation  of  a  Multivariate  Density,"  Ann,  fast.  Statist.  Math., 
Vol.  18,  No.  2,  1966,  Tokyo,  pp.  179—189 

3.  D.  F.  Specht,  "Series  Estimation  of  a  Probability  Density  Function, "  1S68 
(submitted  for  publication) 

4.  D.  F.  Specht,  "Generation  of  Polynomial  Discriminant  Functions  for  Pattern 
Recognition,"  IEEE  TRANS,  on  Elec.  Comp..  Vol.  EC-16,  1967,  pp.  308-319 

5.  T.  M.  Cover,  Estimation  by  the  Nearest-Neighbor  Rule.  SU-SEL-66-090 
(TR  7002-1)  Stanford  Electronics  Labs.,  Stanford,  Calif.,  Sep  1966 

6.  D.  F.  Specht,  Generation  of  Polynomial  Discriminant  Functions  for  Pattern 
Recognition.  Ph.  D.  dissertation,  Stanford  University  (also  available  as 
SU-SEL-66-029, Stanford  Electronics  Labs.,  Stanford,  Calif.,  May  1966,  and  as 
Defense  Documentation  Center  Report  AD  487  537) 

7.  G.  6.  Sebestyen,  Decision-Makinz  Processes  in  Pattern  Recognition.  New  York, 
Macmillan,  1962 


LOCKHEEO  PALO  ALTO  RESEARCH  LABORATORY 

tOCKMttP  MISSItCS  I  SMCf  COMPANY 
A  OtOVf  •(VISION  Of  tOCVMKO  AIICIAIV  COtfOtAT  ION 


G-79-G8-G 


Appendix 

BOUNDS  ON  Y*(x)  FROM  EQUATION  (3. 5) 


From  Eqs.  (3.5)  and  (3.2) 


Y*(x)  = 


y  b  ,  x’1. 

j“4  Jr--JP  1 

/.  i  •  • 


.  X 


(A.l) 


Decomposing  the  coefficients  into  elements  due  to  each  observation 


a. 


Jl***Jp 


a.  .  . 

Jr**V 


Jr  **Jp 


n 

y  a,  .. 

^  jr**V 

i  =  1  1  v 


=  [na2^!  ...  jp|J  1  ...  XpP  exp  (-  X'X./2cr2) 


2  ^jr..i 
i  =  l  1  1 


V 


^r**V  ^na  ji!  "*  *p!  1  Yixii  " •  xp^  exp  (”  SrV2*2)  "  Yi0fj1 _ j  i 
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Then 


n 


ii 


J, 


Y*  (x)  =  i^1  L 
oov~'  n 


P(\  ni  +  $in  nixi  +  .  •  ♦  +  B,  .  .X,  ...  X1^  +  ... 

O...Oi  K10...0i  1  n,...]  i  1  p 

1  p 


i  =  l  L 


“o...Oi  +  a10...0iXl  +  *•*  +  iXl  *'*  XpP  +  " 


Let  the  expression  contained  in  the  first  set  of  brackets  be  represented  by  N?°(x)  and 
ihe  expression  contained  in  the  second  set  of  brackets  be  represented  by  L?°(x) .  If 
N?°(x)  and  l”(x)  are  truncated  to  contain  any  arbitrary  finite  subsets  of  the  original 
terms  but  contain  only  corresponding  pairs  of  terms,  then 


Nj(x)  =  YjLjfx) 


where  Nj(x)  and  Lj(x)  represent  finite  truncations  of  n“(x)  and  L*(x) ,  respec¬ 
tively,  and 


n 

2 

Y‘(x)  =  - -  (A.  2) 

2  Li® 

i  =  l 

Thus,  Y*(x)  is  always  equivalent  to  a  weighted  average  of  the  Y.'s  .  The  weightings 
are,  of  course,  dependent  on  the  value  of  x  at  which  Y*(x)  is  to  be  evaluated. 

When  the  weights  L^(x)  are  all  nonnegative,  the  weighted  average  Y*(x)  is  bounded 
by  the  maximum  and  minimum  values  of  Y{  observed  in  the  sample. 
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Since 


1  .....  (  -i-Af,  .  1  v,..  .  1  /v,..v2  .  #>>  +  — L_  (X!x)J  +  ..  1 


Li(5)..expl--Fjl1  +  7Xl5+-7,XS5,  ♦ 


2J  '-i~ 


-exp 


X!x  -  i  X!X. 

-  l-  2-i~i 


L^(x)  is  always  positive  in  the  untruncated  case.  Letting  z  =  Xj!x/cr 


L^x)  =  £  exp  (-  XJX./2cr2) 


2  J 

\  +  z  +  — —  +  + 

1  Z  2!  *•*  I! 


Since  exp  (-  XjXj/202)  £  0  for  any  X{ ,  L.(x)  s  0  if 


dS£(z) 


1  +  z  +  2i  +  ••• +  n  -  0 


(A.  3) 


(A.  4) 


si<*>  •  fr  -  -h  s*<z> 


Since  the  minimum  value  of  S£{z)  occu  rs  when  S£  (z)  =  0 


2 

S„<z)  =  7T7  for  some  value  of  z  -«o  <  z  ^  » 
1'  '  f  l 
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W'f/ 


I 


Thus 


min  Sg(z)  s  0  H  even  _  (A.  5) 

and  therefore  L.(x)  >  0  for  all  z  when  P^(x)  and  Q^(x)  are  truncated  to  include 
all  terms  up  to  the  S.  order  and  f  is  even. 


\ 

i 

•i 

< 

t 

i 

l 

•i 
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Let  jbeap  component  random  vector  variable  with  transpose  X'jr;  [Xj, . . .  ,Xj, . . .  ,Xp]/and  Y 
be  a  one-dimensional  random  variable  with  a  joint  continuous  distribution  of  density  f(xiy) .  The 
regression  of  Y  given  X  =  $  is  ,  .  *  -w 


E[Y|X  =  xj 


yf(x,y)dy  -F  J°  f(x,y)  dy 


Using  consistent  nonparametric  estimators  of  the  densities.JSJ  YJX  =  x]  can  be  approximated  by 
a  ratio  of  two  general  polynomials  where  the  coefficients  of  the  polynomials  »re  computed  as  a 
function  of  the  observed  sample.()The  technique  described  has  the  following  advantages  compared 
with  classical  polynomial  regression  techniques  utilizing  a  single  polynomial:  1.  The  shape  of  the 
regression  surface  can  be  made  as  complex  as  necessary  to  closely  approximate  Eq.  (I)or  as  sim¬ 
ple  as  desired,  by  proper  choice  of  a  single  parameter  <r  (called  the  smoothing  parameter).  In 
spite  ofthis  flexibility,  the  estimated  value  is  bounded  by  the  minimum  and  maximum  of  the  obser¬ 
vations  Yi  when  the  polynomials  are  truncated  at  an  even  order.  2.  Because  of  the  smoothing 
properties  inherent  in  the  density  estimator,  the  number  of  coefficients  used  in  the  polynomials 
can  approach  or  even  exceed  the  number  of  observations  in  the  sample  with  no  danger  of  the  re¬ 
gression  surface  overfitting  the  data  when  <r  is  suitably  chosen.  3.  A  simple  method  of  determin¬ 
ing  the  coefficients  has  been  derived.  The  calculation  for  the  coefficient  of  a  particular  term 
amounts  to  little  more  than  averaging  the  corresponding  product  of  variables  over  the  set  of  ob¬ 
servations  available.  4.  The  computational  and  storage  requirements  increase  only  linearly  with 
the  number  of  coefficients  used.  The  computational  advantages  of  the  technique  can  be  utilized 
also  when  values  of  X  are  specified  in  the  design  of  an  experiment  and  Y  alone  is  a  random 
variable.  ~  _ _ 
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